This is the Introduction to Open Data Science - course: IODS.
“Open data and content can be freely used, modified, and shared by anyone for any purpose.” (opendefinition.org). However, to make use of all the open data available and scrutinize it properly we need basic skills, tools and lots of motivation. This course will familiarize participants with R R Studio, R Markdown and GitHub. Simultanously all will learn some statistics, too. Every week there will be a lecture and scheduled excercises. Interactive platforms DataCamp, MOOC and GitHub make learning flexible and the enthusiastic supervisory team keep the spirits up even at times of desperation.
I
DATA.
So let’s get started!
My GitHub repository can be found here and course updates can be followed reading my diary posts here.
My research oriented activities are scrutinized here.
All my course updates can be found in my diary here.
I am a self taught R user and really enjoy learning new things within it. R Markdown and GitHub are completely new to me. The latter seemed logical, but with GitHub it was not love at first sight. Some extra (concrete) headache was (is and will potentially be) created by the outdated computational equipment of mine.
I decided to start with DataCamp. It was fun and encouraging, just like the package Swirl. I have my own mysterious ways of figuring things out, but such approaches provide me with a more structured way of learning or memorizing things. Cool!
Finally I got everything running and could upload my files at my GitHub repository. However, I got a little worried. Perhaps I am not skilled enough for this. But, as is said at Cornell:
This week’s analysis exercise focuses on linear regression analysis: carrying it out, interpreting the results and validating the generated model. Based on a study on learning approaches and students achievements in an introductory statistics course in Finland an original dataset learning2014 including 183 responses and 60 variables was provided. Original data can be found here and information about it metadata here . Briefly, the dataset describes student’ Likert scaled (1-5) ratings of deep, surface and strategic learning styles, attitudes towards statistics, exam points and gender. More specifically, deep approach refers to intention to maximize understanding, surface to memorizing without understanding and strategic to ways students organize their studying.
The original dataset needed to be edited for the exercise 2 analysis according to this R script. Firstly, the variables related to deep, surface and strategic learning were combined and averaged, and variable attitude rescaled. Thereafter, observations with zero exam points were excluded. The final dataset included seven variables:gender, age, attitude, deep, stra, surf and points from 160 respondents.
Data is read in and the structure of the dataset is checked.
# Read in the dataset
learning2014<-read.csv(file="learning2014.csv", header=TRUE)
# Inspect the structure
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
Descriptive statics are shown both as a preliminary summary and as a summary stratified by gender to investigate wheather there are clear differences between males and females. The dataset is further explored using pairs for basic scatter plots and ggpairs command (male=blue, female=red) producing scatter plots, mean plots and histograms.
Basic summaries of the variables
#Basic summaries
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Grouped summaries by gender. In the package tableone the command CreateTableOne simultanously does group comparisons.
#Grouped summaries
CreateTableOne(vars=c("age", "attitude", "deep", "stra", "surf", "points"), strata=c("gender"),data=learning2014 )
## Stratified by gender
## F M p test
## n 110 56
## age (mean (sd)) 24.85 (7.36) 26.80 (8.43) 0.127
## attitude (mean (sd)) 2.99 (0.73) 3.44 (0.65) <0.001
## deep (mean (sd)) 3.66 (0.53) 3.72 (0.59) 0.457
## stra (mean (sd)) 3.20 (0.75) 2.96 (0.80) 0.061
## surf (mean (sd)) 2.83 (0.46) 2.70 (0.64) 0.148
## points (mean (sd)) 22.33 (5.83) 23.48 (6.01) 0.234
Preliminary scatter plots
#Scatter plots
p1 <- pairs(learning2014[-1], col=learning2014$gender)
A more advanced plot matrix:
#Matrix of plots
p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)), title="Figure 2. Graphical overview of the learning2014 data")
p
From the output it can be captured that apart from the age all of the variables seem more or less normally distributed. The distributions are similar for both genders apart from the attitude: females seem to have lower average scores (mean(sd)) 2.99(0.73) versus 3.44(0.65), respectively.
The cross-correlation table shows that the highest correlation exists between exam points and attitude. Moreover, there is a slight correlation between surface as well as strategic learning styles with the exam points. The strength of other correlations is not noteworthy. If we split the gender variable, there are some potentially interesting differences in the distributions. However, to keep the scope within reasonable limits with regard to the exercise requirements their further investigation is not carried out.
A multiple linear regression model with three explanatory variables (attitude, strategic and surface learning) selected based on the preliminary inspection of the dataset is fitted to investigate the relationship between them and the dependent outcome variable final exam points.
#First model with three explanatory variables
mod<-lm(points~attitude+stra+surf,data=learning2014)
summary(mod)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The summary report of the fitted model shows that attitude has a positive impact on the final exam points. The estimates for strategic and surface learning prove to be non-signicifant.
Other two models are fitted by firstly excluding surf and then stra. As the variables remain non-significant at the .05 level(data not shown) a final model is fitted including only one explanatory variable: attitude.
#Final model with only one explanatory variable
mod2<-lm(points~attitude,data=learning2014)
summary(mod2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Thus, the regression model \(y=3.5255x_{1}+11.6372\),R2=0.1906, where \(y=\)points and \(x_{1}=\)attitude is chosen. According to the model a one point increase in attitude increases the exam points by 3.5. Explanatory power of the final model has a little smaller R-squared value (0.1906) than did the first one (0.2074) indicating that the model with only one predictor (attitude) predicts appoximately 19% of the variation in students’ exam points.
The variable estimates might be biased if the model does not actually fit the data. Thus, the model fit needs to be explored and is performed here using diagnostic plots: Residals versus Fitted values, normal QQ-plot and residuals vs Leverage.
#Diagnostic plots for the final model
par(mfrow=c(2,2))
plot(mod2,which=c(1,2,5))
Firstly, it is assumed that the size of errors in the model are not dependent on the independent variables. This constant variance of errors assumption seems to hold well as can be seen as the residuals are plotted versus the fitted values. Secondly, the QQ plot implies that deviations of residuals from the theoretical normal distribution seem minimal.Thirdly, standardized residuals plotted against leverage reveal no clear outliers as all observations are grouped close to the plot and no outliers exist.
All these assumptions are valid.
Data including grades, demographic, social and school related features were collected in two Portuguese schools using school reports and questionnaires and stored as two separate datasets regarding performance in distinct subjects, namely Mathematics and Portuguese.
The original data of the analysis in this exercise are freely available as a zip file. Additional metadata information describes basic characteristics of the data sets. For the purpose of this particular exercise they needed to be joined and edited according to this R script. The variables not used for joining the two data sets were combined by averaging them. An additional variable high_use was created by taking the average of the sum of alcohol consumption during weekdays and weekends, which was thereafter further modified to yield a logical TRUE or FALSE high_use variable. A treshold value for higher than average alcohol consumption was chosen to be more than 2 weekly proportions.
Data set is read in and the structure of the dataset is checked.
# Inspect the structure
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
The final data set includes 382 respondents and 35 both integer and factorial variables. The names of the variables are listed below (explanations can be found here) :
#Names of the variables
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The purpose is to investigate the relationship between high alcohol consumption and four chosen independent variables. Based on assumed impact and personal interest they are:
| Variable Name | Explanation | Classification | ||
|---|---|---|---|---|
| SEX | Gender | factorial | binary | F-female/M-male |
| ABSENCES | Number of school abscenses | integer | 0-93 | hours |
| FAMREL | Quality of family relationships | integer | 1-5 | Very bad - Excellent |
| HEALTH | Current health status | integer | 1-5 | Very bad - Very good |
Briefly, I assume that men drink more, more school absences are associated with drinking habits, students with worse family relationships drink more, and better health is associated with less alcohol consumption. Specifically stated hypotheses are: Males are more prone to belonging to high users of alcohol defined by more than 2 proportions a week than females Students with more school absencies drink more than those attending school more frequently. Students with very good family relationships consume less alcohol than do students with bad family relationships. Very good health status is protective for high alcohol consumption (more than 2 weekly proportions) compared to very bad health status.
Let´s take the first look graphically: distributions and correlations as well as suitable bar and box plots grouped by high alcohol usage to better understand the relationship between the chosen variables and the outcome. To have multiple figures on the same plot I use the great multiplot script.
#Matrix of plots
ggpairs(alc_study[-1], mapping = aes(col = sex), lower = list(combo = wrap("facethist", bins = 20)), title="Graphical overview of the 4 variables")
p1 <- ggplot(alc_study, aes(sex)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count") + xlab('Gender')+ggtitle("Gender") +scale_fill_manual(values=c("lightblue","red"))+
theme(plot.title = element_text(size=9))
p2 <- ggplot(alc_study, aes(x=high_use,y= absences,fill=high_use)) + geom_boxplot(outlier.colour="black", outlier.shape=16,
outlier.size=2, notch=FALSE) +
stat_summary(fun.y=mean, geom="point", shape=23, size=4)+ggtitle("Absences (means included)")+
scale_fill_manual(values=c("lightblue","red"))+
theme(plot.title = element_text(size=9))
p3 <- ggplot(alc_study, aes(x=high_use,y= famrel,fill=high_use)) + geom_boxplot(outlier.colour="black", outlier.shape=16,
outlier.size=2, notch=FALSE) +
stat_summary(fun.y=mean, geom="point", shape=23, size=4)+ggtitle("Family relationships (means included)")+
scale_fill_manual(values=c("lightblue","red"))+
theme(plot.title = element_text(size=9))
p4 <- ggplot(alc_study, aes(x=high_use,y= health,fill=high_use)) + geom_boxplot(outlier.colour="black", outlier.shape=16,
outlier.size=2, notch=FALSE) +
stat_summary(fun.y=mean, geom="point", shape=23, size=4)+ggtitle("Health status (means included)")+
scale_fill_manual(values=c("lightblue","red"))+
theme(plot.title = element_text(size=9))
multiplot(p1, p2, p3, p4, cols = 2)
In addition to data visualization, descriptive statistics are shown numerically as a summary grouped by high_use to get a preliminary, quantitative idea of the data set.In the package tableone the command CreateTableOne simultanously does group comparisons.
#Summary
CreateTableOne(vars=c("sex","absences","famrel","health"), strata=c("high_use"),factorVars=c("sex"),data=alc_study )
## Stratified by high_use
## FALSE TRUE p test
## n 268 114
## sex = M (%) 112 (41.8) 72 (63.2) <0.001
## absences (mean (sd)) 3.71 (4.45) 6.37 (6.99) <0.001
## famrel (mean (sd)) 4.00 (0.89) 3.78 (0.93) 0.027
## health (mean (sd)) 3.52 (1.40) 3.70 (1.40) 0.242
There are proportionally more male high users than there are females. Family relationships on average are better for the ones not using a lot of alcohol (4.00 vs 3.78). However, there are no differencies in the median values (4.00 vs 4.00). Both the mean (6.37) and median (4.00) values for absences are higher in the higher usage group than in the group of students consuming less than average amount (3.71,3.00). There are also quite a few outliers in both groups. Surprisingly, the high-user group members have higher current health scores (mean=3.70, median=4.00) than do the non-high-user ones (3.52,4.00). Yet, the differences are small.
Thus, based on both the graphical and numerical overviews it seems that all of the variables except health are at least to some extent associated with the level of alcohol consumption as I hypotesized. Additionally, it has been investigated that there are no relevant correlations between the chosen variables. To further explore the associations a logistic regression analysis is carried out.
A logistic regression model with four explanatory variables selected based on the preliminary interest is fitted to identify the ones related to higher than average student alcohol consumption. Logistic regression is basic approach for binary outcomes and its results provide probability estimates of the event happening (P(Y=1)).
#First model with four explanatory variables
m1<-glm(high_use~sex+absences+famrel+health,data=alc_study,family="binomial")
summary(m1)
##
## Call:
## glm(formula = high_use ~ sex + absences + famrel + health, family = "binomial",
## data = alc_study)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2557 -0.8609 -0.6151 1.0698 2.1157
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.94806 0.58165 -1.630 0.1031
## sexM 1.00440 0.24498 4.100 4.13e-05 ***
## absences 0.09375 0.02264 4.141 3.45e-05 ***
## famrel -0.30625 0.12864 -2.381 0.0173 *
## health 0.08331 0.08779 0.949 0.3426
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 423.95 on 377 degrees of freedom
## AIC: 433.95
##
## Number of Fisher Scoring iterations: 4
The summary of the model confirm what could already be suspected based on the plots and numerical grouped summary. The variables sex, absences and famrel are significant at the 5 % level. The effect of health is not even borderline significant (p=0.3426). From the estimates it can also be captured that the correlation between family relationships and high_use is negative and the correlation between school absences and high_use is positive. Furthermore, being a male positively correlates with the outcome, i.e. increases the risk of being a high consumer of alcohol.
To better understand the odds ratio which is used in logistic regression to quantify the relationship between an explanatory factor and the target variabe the term odds should be clarified. It is the probability divided by the opposite of that probability. Logically, the ratio of two odds is called the odds ratio.Odds higher than 1 mean that the factor is positively associated with event happening, and odds less than 1 refers to a negative, or, if the event is unwanted (e.g. death), a protective effect. To further complicate the matter, based on the primary equation, in the logistic regression model the target variable is not purely odds, but the log of odds. Thus, in the following table, firstly the odds are obtained by exponentiating the summary estimates and, secondly the profile likelihood-based confidence intervals are calculated by using the confint command from MASS package:
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3874923 0.1218232 1.2018295
## sexM 2.7302676 1.6999004 4.4499308
## absences 1.0982852 1.0527088 1.1506747
## famrel 0.7362043 0.5708862 0.9470728
## health 1.0868775 0.9165274 1.2940911
Apart from the variable health, the confidence intervals do not include one, as suspected, referring to a statistically significant factor. If any given confidence interval would include 1 it would mean that the factor could have either a positive or a negative effect, or no effect at all on the risk of the outcome.
Next, the non-significant variable health is excluded and the model is fitted again.
#Second model with three explanatory variables
m2<-glm(high_use~sex+absences+famrel,data=alc_study,family="binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ sex + absences + famrel, family = "binomial",
## data = alc_study)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1998 -0.8427 -0.6062 1.0845 2.1356
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.71798 0.52862 -1.358 0.1744
## sexM 1.02877 0.24341 4.227 2.37e-05 ***
## absences 0.09351 0.02266 4.127 3.68e-05 ***
## famrel -0.29090 0.12744 -2.283 0.0225 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 424.86 on 378 degrees of freedom
## AIC: 432.86
##
## Number of Fisher Scoring iterations: 4
Now all of the variables in the model are significant at the 5 % level.
Oddsratios_Confint
## 2.5 % 97.5 %
## (Intercept) 0.4877368 0.1702903 1.3635949
## sexM 2.7976292 1.7476499 4.5467291
## absences 1.0980228 1.0524564 1.1504519
## famrel 0.7475910 0.5811204 0.9595347
The significant odds ratios reveal that the factorial variable (male) denoting the difference between the genders refers to a 2.8 (1.7 - 4.5) times higher odds for males to be high consumers of alcohol than females. The coefficient for males would actually be the intercept plus the estimate. with increasing absences the person is more likely to be a high alcohol consumer. And on the contrary, the better the family relationships the less likely the person is to be a high consumer of alcohol.
More specifically, to explain and interpret both the summary and the table with odds ratios and confidence intervals I take a more detailed look at the coefficient for absences, which is 0.09352. It can be interpreted as the expected change in log odds for a one-unit increase in the number of absences. The odds ratio can be calculated by exponentiating this value to get 1.0980288 which means that we expect to see about 10% increase in the odds of being in a high alcohol user, for a one-unit increase in absences.
The second model will be used to explore the predictive power of the model:
probabilities<-predict(m2,type="response")
predictions<-probabilities >0.5
# Add the probabilities
alc_study <- mutate(alc_study, probability = probabilities)
# Calculate a logical high use value based on probabilites.
alc_study <- mutate(alc_study, prediction = probability > 0.5)
table(high_use=alc_study$high_use,prediction=predictions) %>% addmargins %>% round(digits=2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 256 12 268
## TRUE 85 29 114
## Sum 341 41 382
table(high_use=alc_study$high_use,prediction=predictions) %>% prop.table %>% addmargins() %>% round(digits=2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67 0.03 0.70
## TRUE 0.22 0.08 0.30
## Sum 0.89 0.11 1.00
There are 256 true negatives and 29 true positives, 85 false negatives and 12 false positives. To conclude, the model predicts high consumption less frequently than it really is as can be demonstrated graphically:
hu <- as.data.frame(prop.table(table(alc_study$high_use)))
pred <- as.data.frame(prop.table(table(alc_study$prediction)))
pp1 <- ggplot(hu, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('OBSERVED high use') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","green"))+
theme(
plot.title = element_text(color="green", size=14, face="bold.italic"),
axis.title.x = element_text(color="green", size=9, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)
pp2 <- ggplot(pred, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('PREDICTED high use') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","orange"))+
theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="orange", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)
multiplot(pp1, pp2, cols = 2)
Since we know how to make predictions with our model, we can also compute the average number of incorrect predictions. In our model it is 25.4% suggesting that the model is lacking accuracy.Logistic regression aims to minimize the incorrectly classified observations. However, the performance of the generated model is better than purely guessing or just tossing a coin, which has the probability of 50 % of predicting the one or another outcome.
#loss_func from DataCamp exercises
loss_func<-function(class,prob){
n_wrong<-abs(class-prob)>0.5
mean(n_wrong)
}
loss_func(class = alc_study$high_use, alc_study$probability)#Training error of the three variable model
## [1] 0.2539267
Cross-validation tests the model on unseen data, i.e. data not used for generating the model. The lower the value the more accurate the model. Cross-validation can also be used to compare models.
#Errors computed and stored
cv <- cv.glm(data = alc_study, cost = loss_func, glmfit = m2, K = 10)
testerror<-cv$delta[1]
A ten-fold cross-validation shows that on average approximately 26% of the observations are missclassified under our model with three explanatory variables: sex, famrel and absencees. The average number of wrong predictions in the cross-validation is thus about the same as in the DataCamp exercise (26%).
However, if the goal of a study like this would be to identify, and thereafter to implement measures to control risk factors for high alcohol usage we would not succeed very well. It seems that the model is still missing relevant risk factors. Thus, I will try to improve the model by including another variable, namely age, in the model.
mbetter<-glm(high_use~famrel+absences+sex+age,family=binomial,data=alc)
#Errors computed and stored
cvbetter <- cv.glm(data = alc, cost = loss_func, glmfit = mbetter, K = 10)
cvbetter$delta[1] # Print the average number of wrong predictions.
## [1] 0.2565445
Adding age to the model hardly improves the performance yielding just a little bit lower testing error.
A data set with 11 integer and one factorial variable (gender) is used as a whole to improve the predictive accuracy.
alcs<-alc[c("age","Medu","Fedu","failures","G3","sex","absences","famrel","health","goout","studytime","freetime","high_use")]
full<-glm(high_use~.,family=binomial,data=alcs)
#Errors computed and stored
cvfull <- cv.glm(data = alcs, cost = loss_func, glmfit = full, K = 10)
cvfull$delta[1] # Print the average number of wrong predictions.
## [1] 0.2329843
The full model based on this data set still has a reasonably high prediction error.
I choose not to compare model performance solely based on the number of variables, but rather based on the selection methods. To make a purposeful selection and model comparisons firstly stepAIC() and thereafter bestglm() are used. More information about the selection methods can be found in an article here.
library(leaps)
library(MASS)
library(bestglm)
scope<-stepAIC(full,scope=list(lower=~sex+absences+famrel,upper=full),trace=FALSE)
scope$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## high_use ~ age + Medu + Fedu + failures + G3 + sex + absences +
## famrel + health + goout + studytime + freetime
##
## Final Model:
## high_use ~ sex + absences + famrel + goout + studytime
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 369 369.5715 395.5715
## 2 - G3 1 0.02581712 370 369.5973 393.5973
## 3 - failures 1 0.32319318 371 369.9205 391.9205
## 4 - freetime 1 0.60688507 372 370.5274 390.5274
## 5 - Fedu 1 0.55002583 373 371.0774 389.0774
## 6 - Medu 1 0.16331742 374 371.2407 387.2407
## 7 - age 1 1.06533948 375 372.3061 386.3061
## 8 - health 1 1.67027019 376 373.9764 385.9764
bestglm(alcs,IC="BIC",family=binomial)
## Morgan-Tatar search since family is non-gaussian.
## BIC
## BICq equivalent for q in (0.268888789905018, 0.514103143659991)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76826342 0.66169773 -4.183577 2.869578e-05
## sexM 1.01234164 0.25894570 3.909475 9.249710e-05
## absences 0.08167686 0.02199598 3.713263 2.046039e-04
## famrel -0.39378406 0.14034903 -2.805748 5.019988e-03
## goout 0.76761101 0.12316480 6.232389 4.593740e-10
Based on the outputs of the two approaches best models are fitted. The former (AIC) one includes gender, absences, family relationships, going out with friends and study time and the latter (bestglm) gender, absences, family relationships and going out with friends.
hu <- as.data.frame(prop.table(table(alcs$high_use)))
predfull <- as.data.frame(prop.table(table(alcsfull$prediction)))
predAIC <- as.data.frame(prop.table(table(alcsAIC$prediction)))
predbest <- as.data.frame(prop.table(table(alcsbest$prediction)))
ppobs <- ggplot(hu, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('Observed ') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","green"))+
theme(
plot.title = element_text(color="green", size=14, face="bold.italic"),
axis.title.x = element_text(color="green", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)
ppfull <- ggplot(predfull, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('Full model') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","orange"))+
theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="orange", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)
ppAIC <- ggplot(predAIC, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('BestAIC model ') + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","darkgreen"))+
theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="darkgreen", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)
ppbest<- ggplot(predbest, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab("Bestglm model") + theme(legend.position = 'none')+
scale_fill_manual(values=c("lightblue","blue"))+
theme(
plot.title = element_text(color="blue", size=14, face="bold.italic"),
axis.title.x = element_text(color="blue", size=8, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold")
)
multiplot(ppobs, ppfull, ppAIC, ppbest,cols = 4)
| Model | Training error | Test error |
|---|---|---|
| FULL model with 12 variables | 0.2094241 | 0.2329843 |
| AIC model with 5 variables | 0.2172775 | 0.2225131 |
| BEST model with 4 variables | 0.2041885 | 0.2198953 |
The last model uses best subsets approach aiming to find out the best fit model from all possible subset models. It has only four explanatory variables: sex, absences, family relationships and going out with friends. Using cross-validation it yield a test error of just below 22%, i.e. predicting almost 78% right. However, there are still 14 false negatives and 64 false positives.
probabilities<-predict(mbest,type="response")
predictions<-probabilities >0.5
# Add the probabilities
alcbest <- mutate(alcs, probability = probabilities)
# Calculate a logical high use value based on probabilites.
alcbest <- mutate(alcbest, prediction = probability > 0.5)
table(high_use=alcbest$high_use,prediction=predictions) %>% addmargins %>% round(digits=2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 254 14 268
## TRUE 64 50 114
## Sum 318 64 382
To conclude, the results of the final model as odds ratios with th corresponding confidence intervals:
stargazer(mbest, coef = list(OR.vector), ci = T,
ci.custom = list(CI.vector), p = list(p.values),
single.row = T,align=TRUE, type = "html",
title="Logistic regression model: Variables related to alcohol consumption",column.labels=c("odds ratio(confint)"),
covariate.labels=c("Gender(M)","School absences","Family relationships","Going out with friends"),
dep.var.caption = "Model with the lowest testing error",
dep.var.labels = "Dependent variable:More than average alcohol usage" )
| Model with the lowest testing error | |
| Dependent variable:More than average alcohol usage | |
| odds ratio(confint) | |
| Gender(M) | 2.752*** (1.668, 4.613) |
| School absences | 1.085*** (1.040, 1.135) |
| Family relationships | 0.674*** (0.511, 0.887) |
| Going out with friends | 2.155*** (1.703, 2.764) |
| Constant | 0.063*** (0.016, 0.223) |
| Observations | 382 |
| Log Likelihood | -189.904 |
| Akaike Inf. Crit. | 389.809 |
| Note: | p<0.1; p<0.05; p<0.01 |
Ref. Fabio Pagnotta’s and Hossain Mohammad Amran’s Using Data Mining To Predict Secondary School Student Alcohol Consumption (2008), published by Department of Computer Science of the University of Camerino
Cluster analysis is one of the main tasks of exploratory data mining and is thus the topic of this week’s exercise. Clustering techniques identify similar groups or clusters among observations so that members within any segment are more similar while data across segments are different. However, defining what is meant by that requires often a lot of contextual knowledge and creativity.
The Boston data will be used. It can be loaded from the R package MASS.According to the ?Boston, the data frame has 506 rows (observations) of 14 columns (variables). Briefly, the data report several variables potentially explaining housing values around Boston. Our aim is to classify the included suburbs from Boston data set into classes based on their characteristics. The variables are:
crim: per capita crime rate by town
zn:proportion of residential land zoned for lots over 25,000 sq.ft.
indus:proportion of non-retail business acres per town.
chas:Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox:nitrogen oxides concentration (parts per 10 million).
rm:average number of rooms per dwelling.
age:proportion of owner-occupied units built prior to 1940.
dis:weighted mean of distances to five Boston employment centres.
rad:index of accessibility to radial highways.
tax:full-value property-tax rate per $10,000.
ptratio:pupil-teacher ratio by town.
black:1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat:lower status of the population (percent).
medv:median value of owner-occupied homes in $1000s
Firstly, the data are loaded, glimpsed and thereafter summaries are printed.
# load data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
The descriptive summary includes values of skewness (a measure of the symmetry in a distribution) and kurtosis (measuring the “tail-heaviness” of the distribution) and already shows the separating quantile values for crime to be further used later.
tab1<-CreateTableOne(vars=c("crim","zn", "indus","chas","nox" ,"rm"
,"age","dis" ,"rad" , "tax" ,"ptratio", "black"
,"lstat" , "medv"), factorVars = c("rad", "chas"),data=Boston)
summary(tab1)
##
## ### Summary of continuous variables ###
##
## strata: Overall
## n miss p.miss mean sd median p25 p75 min max skew
## crim 506 0 0 3.6 8.6 0.3 8e-02 3.7 6e-03 89.0 5.2
## zn 506 0 0 11.4 23.3 0.0 0e+00 12.5 0e+00 100.0 2.2
## indus 506 0 0 11.1 6.9 9.7 5e+00 18.1 5e-01 27.7 0.3
## nox 506 0 0 0.6 0.1 0.5 4e-01 0.6 4e-01 0.9 0.7
## rm 506 0 0 6.3 0.7 6.2 6e+00 6.6 4e+00 8.8 0.4
## age 506 0 0 68.6 28.1 77.5 5e+01 94.1 3e+00 100.0 -0.6
## dis 506 0 0 3.8 2.1 3.2 2e+00 5.2 1e+00 12.1 1.0
## tax 506 0 0 408.2 168.5 330.0 3e+02 666.0 2e+02 711.0 0.7
## ptratio 506 0 0 18.5 2.2 19.1 2e+01 20.2 1e+01 22.0 -0.8
## black 506 0 0 356.7 91.3 391.4 4e+02 396.2 3e-01 396.9 -2.9
## lstat 506 0 0 12.7 7.1 11.4 7e+00 17.0 2e+00 38.0 0.9
## medv 506 0 0 22.5 9.2 21.2 2e+01 25.0 5e+00 50.0 1.1
## kurt
## crim 37.13
## zn 4.03
## indus -1.23
## nox -0.06
## rm 1.89
## age -0.97
## dis 0.49
## tax -1.14
## ptratio -0.29
## black 7.23
## lstat 0.49
## medv 1.50
##
## =======================================================================================
##
## ### Summary of categorical variables ###
##
## strata: Overall
## var n miss p.miss level freq percent cum.percent
## chas 506 0 0.0 0 471 93.1 93.1
## 1 35 6.9 100.0
##
## rad 506 0 0.0 1 20 4.0 4.0
## 2 24 4.7 8.7
## 3 38 7.5 16.2
## 4 110 21.7 37.9
## 5 115 22.7 60.7
## 6 26 5.1 65.8
## 7 17 3.4 69.2
## 8 24 4.7 73.9
## 24 132 26.1 100.0
##
To get a better idea of the variables and their distributions some plots are generated.
#density plots for numerical variables
Boston %>%
keep(is.numeric) %>% # keep only numeric columns
gather() %>% # convert to key-value pairs
ggplot(aes(value)) + # plot the values
facet_wrap(~ key, scales = "free") + # in separate panels
geom_density() # as density
#histograms for integer variables
Boston %>%
keep(is.integer) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins=10)
Due to variable characteristics (percents, proportions or function based values) it is understandable that most of them have rather uneven/ skewed distributions: e.g. proportion of black people (scaled proportion of blacks), indus (proportion of non-retail business acres), age (proportion of owner-occupied units built prior to 1940), proportion of land zond for very large lots (zn) and lstat (lower status of the population (percent)). On the contrary, dwelling size referring to the number of rooms (rm) is normally distributed and median value of owner-occupied homes (medv) can also be judged to be. Charles River dummy variable (chas) is binary (1/0) referring to the river crossing the area and the radial highways accessibility (rad) is an interval scaled index.
#to get a better idea about variable crim it is plotted separately
plot(Boston$crim, col="red",pch=8, main="Crime rate per capita")
text(198,59," Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01 0.08 0.26 3.61 3.68 88.98" )
ggplot(Boston, aes(x = crim)) +
stat_density(aes(col="red"),position="identity",geom="line",size=2)+
ggtitle("Crime rate per capita")+ theme(legend.position="none")
We are especially interested in the variable crim. Thus, it is visualized separately. Crime rate varies a lot between areas ranging from min=0.01 to max=88.98. Quite a few high outlier values among the 506 observations as can be seen in the plot contribute to the low average value of 3.61 and median value of 0.26. The distribution is strongly skewed to the left.
To explore the relations between the variables of the data set pairwise scatter plots and a correlation plots are printed.
#scatterplots
pairs(Boston,lower.panel = NULL)
To me these scatter plots are not that informative, though. Thus, I will try another approach where the correlation chart presents simultanously both the direction (color) and the magnitude (size of the circle) as well as the values of the correlation.
#a more visual correlation matrix
cor_matrix<-cor(Boston) %>% round(2)
corrplot.mixed(cor_matrix,number.cex=0.65,tl.cex=0.6)
And, finally, I create just a conservative table of correlations with notions for significance levels with the help of this.
library(xtable)
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
result=c("none", "html", "latex")){
#Compute correlation matrix
require(Hmisc)
x <- as.matrix(x)
correlation_matrix<-rcorr(x, type=method[1])
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
## Define notions for significance levels; spacing is important.
mystars <- ifelse(p < .0001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " "))))
## trunctuate the correlation matrix to two decimal
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
## build a new matrix that includes the correlations with their apropriate stars
Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep="")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep="")
## remove upper triangle of correlation matrix
if(removeTriangle[1]=="upper"){
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove lower triangle of correlation matrix
else if(removeTriangle[1]=="lower"){
Rnew <- as.matrix(Rnew)
Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove last column and return the correlation matrix
Rnew <- cbind(Rnew[1:length(Rnew)-1])
if (result[1]=="none") return(Rnew)
else{
if(result[1]=="html") print(xtable(Rnew), type="html")
else print(xtable(Rnew), type="latex")
}
}
# Make table with centered columns
corstars(Boston,result="html")
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| crim | |||||||||||||
| zn | -0.20**** | ||||||||||||
| indus | 0.41**** | -0.53**** | |||||||||||
| chas | -0.06 | -0.04 | 0.06 | ||||||||||
| nox | 0.42**** | -0.52**** | 0.76**** | 0.09* | |||||||||
| rm | -0.22**** | 0.31**** | -0.39**** | 0.09* | -0.30**** | ||||||||
| age | 0.35**** | -0.57**** | 0.64**** | 0.09 | 0.73**** | -0.24**** | |||||||
| dis | -0.38**** | 0.66**** | -0.71**** | -0.10* | -0.77**** | 0.21**** | -0.75**** | ||||||
| rad | 0.63**** | -0.31**** | 0.60**** | -0.01 | 0.61**** | -0.21**** | 0.46**** | -0.49**** | |||||
| tax | 0.58**** | -0.31**** | 0.72**** | -0.04 | 0.67**** | -0.29**** | 0.51**** | -0.53**** | 0.91**** | ||||
| ptratio | 0.29**** | -0.39**** | 0.38**** | -0.12** | 0.19**** | -0.36**** | 0.26**** | -0.23**** | 0.46**** | 0.46**** | |||
| black | -0.39**** | 0.18**** | -0.36**** | 0.05 | -0.38**** | 0.13** | -0.27**** | 0.29**** | -0.44**** | -0.44**** | -0.18**** | ||
| lstat | 0.46**** | -0.41**** | 0.60**** | -0.05 | 0.59**** | -0.61**** | 0.60**** | -0.50**** | 0.49**** | 0.54**** | 0.37**** | -0.37**** | |
| medv | -0.39**** | 0.36**** | -0.48**** | 0.18**** | -0.43**** | 0.70**** | -0.38**** | 0.25**** | -0.38**** | -0.47**** | -0.51**** | 0.33**** | -0.74**** |
From all the information above it can be captured that there are several relevant correlations between the variables. Strong negative correlations exist between weighted mean distances to five Boston employment centres (dis) and proportion of non-retail business acres per town (indus) / nitrogen oxide (nox) / and older properties (age). Not surprisingly, a strong negative correlation between lower status of the population (percent) and median home value (medv) is seen. Strong positive correlations especially between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax) / proportion of non-retail business acres per town (indus) exist. Proportion of non-retail business acres per town (indus) is further positively correlated with nitrogen oxide (nox) and full-value property tax-rate (tax). Furthermore, one of our main interests, the crime rate is correlated with many of the variables: e.g. negatively with e.g. distance to employment centers (dis) and median home value (medv), positively with e.g. full-value property tax-rate (tax) and access to radial highways (rad). Thus, an increase in crime rate seems to be associated with an increasing highway accessibility index and property tax.
Linear discriminant analysis is a method generating linear combinations to charachterize variable classes. To enable the method the data set needs to be standardized, i.e. all variables fit to normal distribution so that the mean of every variable is zero by ubtracting the column means from the corresponding columns and dividing the difference with standard deviation:
\[ scaled(x) = \frac{x-means(x)}{sd(x)} \]
#scale the dataset
boston_scaled <- as.data.frame(scale(Boston))
In comparison, the values of the latter table have decreased, and all mean values are converted to zero and standard deviations to 1.
In addition to scaling, a categorical variable of the scaled crime rate has to be created. Quantiles are used for this to yield four grouping values: low, medium low, medium high and high crime rates and thus four groups with approximatey equal numbers of observation each.
Next, the data set is randomly spit for the analysis to train (80%) and test (20%) sets. Thus, the train set has 404 and the test set 102 variables.
# create a quantile vector of crim, and use it to categorize crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# explore the categorised variable.
table(boston_scaled$crim)
##
## low med_low med_high high
## 127 126 126 127
To compare how the original values by the created groups differ between the groups a summary is created stratified by the different crime rate categories. There are many large differences, especially when comparing the high and low crime rate groups.
Boston2<-Boston
Boston2$crime<-boston_scaled$crim
CreateTableOne(vars=c("zn", "indus","chas","nox" ,"rm"
,"age","dis" ,"rad" , "tax" ,"ptratio", "black"
,"lstat" , "medv"), strata=c("crime"), test=FALSE, data=Boston2)
## Stratified by crime
## low med_low med_high
## n 127 126 126
## zn (mean (sd)) 33.85 (34.67) 9.11 (14.54) 2.41 (6.59)
## indus (mean (sd)) 4.88 (4.28) 9.14 (5.80) 12.41 (6.57)
## chas (mean (sd)) 0.04 (0.20) 0.06 (0.24) 0.12 (0.33)
## nox (mean (sd)) 0.45 (0.05) 0.49 (0.06) 0.60 (0.11)
## rm (mean (sd)) 6.60 (0.56) 6.19 (0.47) 6.36 (0.88)
## age (mean (sd)) 43.65 (19.59) 59.03 (29.05) 80.18 (21.91)
## dis (mean (sd)) 5.64 (2.08) 4.54 (1.93) 3.00 (1.25)
## rad (mean (sd)) 3.54 (1.59) 4.78 (1.50) 5.96 (4.28)
## tax (mean (sd)) 283.83 (63.32) 327.83 (102.02) 356.32 (91.50)
## ptratio (mean (sd)) 17.50 (1.89) 18.32 (1.64) 17.84 (2.86)
## black (mean (sd)) 391.25 (9.19) 386.14 (30.87) 364.64 (56.91)
## lstat (mean (sd)) 7.15 (2.79) 11.71 (5.53) 12.74 (6.91)
## medv (mean (sd)) 27.37 (8.00) 22.51 (5.38) 24.10 (10.28)
## Stratified by crime
## high
## n 127
## zn (mean (sd)) 0.00 (0.00)
## indus (mean (sd)) 18.11 (0.13)
## chas (mean (sd)) 0.06 (0.23)
## nox (mean (sd)) 0.68 (0.06)
## rm (mean (sd)) 6.00 (0.69)
## age (mean (sd)) 91.45 (9.95)
## dis (mean (sd)) 2.00 (0.55)
## rad (mean (sd)) 23.85 (1.69)
## tax (mean (sd)) 663.93 (23.34)
## ptratio (mean (sd)) 20.16 (0.49)
## black (mean (sd)) 284.96 (147.79)
## lstat (mean (sd)) 19.00 (6.85)
## medv (mean (sd)) 16.16 (8.63)
Linear Discriminant Analysis (LDA) model is carried out to classify the suburbs using the categorized crime rate as the target variable. Firstly, classification is performed on the training dataset, and thereafter the classes are predicted on the test data.
set.seed(123)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
Based on the analysis results are plotted firstly plain, and thereafter as a LDA (bi)plot with the help of a specificifally generated “arrow”-function to add arrows. It has to be kept in mind that for plotting the classes have to tranformed from categorical to numeric.
#helper function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#crime classes to numeric for plotting
classes <- as.numeric(train$crime)
#plotting the lda
p1<-plot(lda.fit, dimen = 2, col = classes, pch = classes)
#(bi)plot
p2<-plot(lda.fit, dimen = 2, col = classes, pch = classes)
#arrows
lda.arrows(lda.fit)
print(lda.fit)
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2475248 0.2549505 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 0.9432005 -0.8959713 -0.11640431 -0.8577158 0.44620905
## med_low -0.1512978 -0.3010310 -0.07547406 -0.5591764 -0.11907536
## med_high -0.4023185 0.2609362 0.18636222 0.4392319 0.04779161
## high -0.4872402 1.0149946 -0.07547406 1.0412268 -0.50541663
## age dis rad tax ptratio
## low -0.8546026 0.8232513 -0.6953230 -0.7314517 -0.45097009
## med_low -0.3467956 0.2952511 -0.5420083 -0.4653999 -0.03350366
## med_high 0.4318874 -0.4060664 -0.4087527 -0.2821208 -0.32611400
## high 0.7952399 -0.8387924 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.37835870 -0.76633833 0.51252310
## med_low 0.32163001 -0.18506538 0.01078532
## med_high 0.08166164 0.03210543 0.15583308
## high -0.74993410 0.87136020 -0.65964311
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08288224 0.74545012 -1.02758413
## indus 0.03371880 -0.27414286 0.12800485
## chas -0.07688231 -0.04569133 0.03313223
## nox 0.24620783 -0.70692830 -1.31233352
## rm -0.13308714 -0.06060304 -0.15792484
## age 0.29339946 -0.28245910 -0.17691125
## dis -0.04642187 -0.28428720 0.07074842
## rad 3.40029809 0.91155778 -0.10269503
## tax 0.04070331 -0.03257583 0.61283757
## ptratio 0.11422801 0.09094945 -0.21802056
## black -0.15410658 0.03229362 0.15457698
## lstat 0.19460392 -0.28843142 0.29476658
## medv 0.17982251 -0.40689877 -0.20349106
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9510 0.0368 0.0122
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 16 10 0 0 26
## med_low 7 15 4 0 26
## med_high 1 6 15 1 23
## high 0 0 1 26 27
## Sum 24 31 20 27 102
By crosstabulating the correct and predicted classes it can be seen that the model’s predictions are quite accurate with the high category in the test data. On the contrary, the low areas are not recognized that well, and both the medium low and medium high classes seem to be problematic. By reflecting the results to the graph it can be captured that the separation is clearest with regard to the highest class. Due to that the prediction accuracies are understandable.
In comparison to LDA, K-means is a clustering method that divides observations into clusters.
For K means clustering, the Boston dataset is rescaled, so that the distances are comparable. To examine the distance properties of the data and compare methods superficially both the Euclidian (geometric) and Manhattan (along the axes) distance summaries are printed.
data(Boston)
#center and standardize variables and make it a data frame
boston_scaled<-as.data.frame(scale(Boston))
#Euclidean distance matrix
dist_eu<-dist(boston_scaled)
#for comparison Manhattan distance matrix
dist_man<-dist(boston_scaled,method = "manhattan" )
#summaries
summary(dist_eu)#Euclidian
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
summary(dist_man)#Manhattan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
K-means algorith is exploratorily ran on the dataset using 5 clusters.
#kmeans using euclidean and five clusters
km <- kmeans(dist_eu, centers = 5)
pairs(boston_scaled, col = km$cluster,lower.panel = NULL)
At the first sight the plotted reslts look a little like fireworks. Before interpreting the plot in more detail the optimal number of clusters using the total within cluster sum of squares (WCSS) with the number of clusters ranging from 1 to 10 is estimated and the results visualized.
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
#Plot the results
plot(1:k_max, twcss, type = 'b')
The most prominent change in the sum of squares happens at 2. However, there are still drops after two, too, but they only yield small improvements. Yet, I choose to carry out the k-means clustering again using 2 as the number of clusters.
km <- kmeans(dist_eu, centers = 2)
pairs(boston_scaled, col = km$cluster, lower.panel = NULL)
The new pairwise scatter plot with only two clusters looks better than the previous one. All data points are assigned to two red/black clusters. The clearer separation for the colours the more relevant for clustering the variable. Property tax and access to radial highways seem to discriminate quite well between the two clusters.
K-means is performed on the scaled Boston data using 4 clusters. Therafter, LDA is fitted using the generated clusters as target classes. Biplot is printed.
set.seed(123)
data(Boston)
boston_scaled4 <- as.data.frame(scale(Boston,center=TRUE,scale = TRUE))
dist_eu4 <- dist(boston_scaled4)
km2 <-kmeans(dist_eu4, centers = 4)
#pairs(boston_scaled4, col = km$cluster, lower.panel = NULL)
boston_scaled4$classes<-km2$cluster
lda.fit2 <- lda(boston_scaled4$classes ~., data = boston_scaled4)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda.fit2, dimen = 2, col= as.numeric(boston_scaled4$classes), pch=classes)
lda.arrows(lda.fit2, myscale = 2)
The given code is ran on the scaled train set. A matrix is created including projections of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
3D plot without colours is created:
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
3D plot is created the categorized crime rate classes defining the colours:
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crim)
3D plot is created the last K-mean clusters defining the colours:
train$cluster<-km2$cluster[match(rownames(train),rownames(boston_scaled4))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cluster))